Cellular Genetic Programming with Bagging and Boosting for the Data Mining Classification task
نویسندگان
چکیده
Genetic programming (GP )[14] is a general purpose method that has been successfully applied to solve problems in different application domains. In the data mining field [8], GP has showed to be a particularly suitable technique to deal with the task of data classification [12, 9, 10, 11] by evolving decision trees. Many data mining applications manage databases consisting of a very large number of objects, each of which having several attributes. This huge amount of data (gigabytes or even terabytes of data) is too large to fit into the memory of computers, thus it causes serious problems in the realization of predictors, such as decision trees [15]. One approach is to partition the training data into small subsets, obtain an ensemble of predictors on the base of each subset, and then use a voting classification algorithm to predict the class label of new objects [5, 4, 6]. Bagging [2] is one of the well known ensemble techniques that builds bags of data of the same size of the original data set by applying random sampling with replacement. More complex techniques such as boosting [13] and arching [3] adaptively change the distribution of the sample depending on how difficult each example is to classify. Bagging, boosting and variants have been studied and compared, and shown to be successful in improving the accuracy of predictors [7, 1]. These techniques, however, requires that the entire data sets be stored in main memory. When applied to large data sets this kind of approach could be impractical. In this case data reduction through the partitioning of the data set into smaller subsets seems a good approach, though an important aspect to consider is which kind of partitioning has the minimal impact on the accuracy of results. Furthermore, to speed up the overall predictor generation process it seems straightforward to consider a parallel implementation of bagging. Cellular genetic programming (CGP ) for data classification was proposed in [9]. The method uses cellular automata as a framework to enable a fine-grained parallel implementation of GP through the diffusion model. In this paper we present an extension of Cellular Genetic Programming for data classification to induce an ensemble of predictors, each trained on a different subset of the overall data, and then combine them together to classify new tuples by applying different voting algorithms, like bagging and boosting. Preliminary results on a large data set show that the ensemble of classifiers trained on a subset of the data set obtains higher accuracy than a single classifier that uses the entire data set.
منابع مشابه
Improving reservoir rock classification in heterogeneous carbonates using boosting and bagging strategies: A case study of early Triassic carbonates of coastal Fars, south Iran
An accurate reservoir characterization is a crucial task for the development of quantitative geological models and reservoir simulation. In the present research work, a novel view is presented on the reservoir characterization using the advantages of thin section image analysis and intelligent classification algorithms. The proposed methodology comprises three main steps. First, four classes of...
متن کاملCombining Bagging and Boosting
Bagging and boosting are among the most popular resampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Boosting algorithms are considered stronger than bagging on noisefree data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, i...
متن کاملAdvanced Methodologies Employed in Ensemble of Classifiers: A Survey
If we look a few years back, we will find that ensemble classification model has outbreak many research and publication in the data mining community discussing how to combine models or model prediction with reduction in the error that results. When we ensemble the prediction of more than one classifier, more accurate and robust models are generated. We have convention that bagging, boosting wit...
متن کاملBagging, Boosting, and Bloating in Genetic Programming
We present an extension of GP (Genetic Programming) by means of resampling techniques, i.e., Bagging and Boosting. These methods both manipulate the training data in order to improve the learning algorithm. In theory they can signi cantly reduce the error of any weak learning algorithm by repeatedly running it. This paper extends GP by dividing a whole population into a set of subpopulations, e...
متن کاملEnsemble of M5 Model Tree Based Modelling of Sodium Adsorption Ratio
This work reports the results of four ensemble approaches with the M5 model tree as the base regression model to anticipate Sodium Adsorption Ratio (SAR). Ensemble methods that combine the output of multiple regression models have been found to be more accurate than any of the individual models making up the ensemble. In this study additive boosting, bagging, rotation forest and random subspace...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003